Remaking peer’s original design by improving chart’s clarity and aesthetics and creating an alternative design if needed.
In this Take-home Exercise 2, I have chosed one of my classmate’s Take-home Exercise 1 submission and analysed the charts in terms of clarity and aesthetics. Also, remaked the original design by using data visualisation principles and best practises learnt in previous two classes.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R pacakges. If they have yet to be installed, we will install the R packages and load them onto R environment.The required packages are tidyverse, ggplot2, dplyr and patchwork.
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('tidyverse','ggplot2','dplyr','patchwork')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports Participants.csv from the data
folder into R by using read_csv()
of readr
and save it as an tibble data frame called data
participantId householdSize haveKids age educationLevel
1 0 3 TRUE 36 HighSchoolOrCollege
2 1 3 TRUE 25 HighSchoolOrCollege
3 2 3 TRUE 35 HighSchoolOrCollege
4 3 3 TRUE 21 HighSchoolOrCollege
5 4 3 TRUE 43 Bachelors
6 5 3 TRUE 32 HighSchoolOrCollege
interestGroup joviality
1 H 0.001626703
2 B 0.328086500
3 A 0.393469590
4 I 0.138063446
5 H 0.857396691
6 D 0.772957791
The original design is shown below
Usage of Bar chart is ideal as it serves the purpose of displaying the count of residents in each age category. Yet, this chart can be improved in terms of following criteria
Interesting insight from the graph can be put as a main title for chart instead of conventional graph name as the former creates curiosity among the users to deep dive in it.
New axes labels can be provided for graphs rather than using same as its column names as sometimes column names such as cnt will not be meaningful for users while reading charts.
Tick marks can be superfluous on categorical scale. Here age groups are grouped into a bin and so tick marks on the x-axis are not required.
Main title at the center will be more appealing than at the left.
It would be easier for users to read an interpret the graph if its axes labels are written horizontally instead of vertical direction. Atleast, it can be tilted slightly due to space constraints.
Since the data labels are already mentioned at the top of bars, here the grid lines are redundant.
Rough sketch of proposed design is shown below
The values ranging from [0,1] indicating the participant’s overall
happiness level at the start of the study are recoded into multiple
levels such as
‘18-20’,‘21-25’,‘26-30’,‘31-35’,‘36-40’,‘41-45’,‘46-50’,‘51-55’,‘56-60’
using the below code chunk. It can be performed using cut()
which helps to convert the numeric values to factors.
After performing necessary modifications, the final code and design are as follows:
data <- data %>% filter(!is.na(ageGroup)) # filter on non-missing values
p1<- ggplot(data,aes(x=ageGroup))+
geom_bar()+
ylim(0,150)+
geom_text(stat='count', aes(label=paste0(stat(count))),vjust=-0.5)+
labs(y= 'No. of\n Residents',title="Residents of age 46-50 are predominant", x='Age Group') +
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),panel.background = element_blank(),
axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5), axis.text = element_text(face="bold") )
p1
The original designs are shown below
Line chart is usually used to showcase the trend over the duration usually for timeline. As the objective of this chart is to visualize the proportion of residents who are having kids with respect to total residents of that age group, Stacked bar chart will be ideal in catering the needs.
Values in percentage rather decimals would give the audience a better picture and help them to appreciate and understand the underlying insights from the graph.
Derived values reveal more interesting patterns than absolute values. Though this graph conveys the no. of residents of specific age group with kids versus the total residents, it would be more easy to compare with percentage info.
Rough sketch of proposed design is shown below
True and False values of haveKids column is replaced by With Kids and Without Kids respectively.
data["haveKids"][data["haveKids"] == "TRUE"] <- "With Kids"
data["haveKids"][data["haveKids"] == "FALSE"] <- "Without Kids"
The proportion of residents with kids in each age category is
computed using below code chunk. group_by().
group_by() function is used to group the dataframe by
multiple columns such as ageGroup, haveKids and tally()
function helps to count the unique values of variables.
df <- data %>%
group_by(ageGroup,haveKids) %>% tally()
df <- df %>% group_by(ageGroup) %>%
mutate(total=sum(n),prop=round(n*100/total)) %>%
ungroup()
head(df)
# A tibble: 6 x 5
ageGroup haveKids n total prop
<fct> <chr> <int> <int> <dbl>
1 18-20 With Kids 26 72 36
2 18-20 Without Kids 46 72 64
3 21-25 With Kids 33 112 29
4 21-25 Without Kids 79 112 71
5 26-30 With Kids 33 118 28
6 26-30 Without Kids 85 118 72
This dataset contains NA. So, let’s filter it out using filter
The below code chunk provides the chart which shows the no. of residents who are with and without kids.
p2<- ggplot(data,aes(x=ageGroup, fill=haveKids))+
geom_bar()+
ylim(0,150)+
geom_text(stat='count', aes(label=paste0(stat(count))),vjust=-0.5,hjust=0.5)+
labs(y= 'No. of\n Residents',title="Residents of age 46-50 are predominant", x='Age Group',fill="Kids") +
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),
panel.background = element_blank(), axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5), axis.text = element_text(size=10,face="bold"))
p2
Our next objective is to visualise in 100 % stacked bar chart with percentage info. The below code accomplishes our task
p3 <- ggplot(data=df,aes(x=ageGroup,
y=prop,
fill=haveKids))+
geom_col()+
geom_text(aes(label=paste0(prop,"%")),
position = position_stack(vjust=0.5),size=3)+
theme(axis.text.x=element_text(angle=0))+
xlab("Age Group")+
ylab("% of \n Residents")+
ggtitle("Proportion of residents with & without kids")+
theme_bw()+
guides(fill=guide_legend(title="Kids"),
shape=guide_legend(override.aes = list(size=0.5)))+
theme(plot.title = element_text(hjust=0.5, size=13),
legend.title = element_text(size=9),
legend.text = element_text(size=7),
axis.text = element_text(face="bold"),
axis.ticks.x=element_blank(),
axis.title.y=element_text(angle=0),
axis.title.y.left = element_text(vjust = 0.5))
p3
The original design is shown below
This colorful chart attracts the user at first sight and gives an outline of happiness index of residents in each age category as well as their education level.Yet, this chart can be improved in terms of following criteria
Labeling the education level right next to the line gives much better view.
Soft , natural colors can be used to display most information and bright / dark colours can be used to highlight the specific information that requires greater attention.
Line chart is usually used to showcase the trend over the duration usually for timeline. As the objective of this chart is to visualize the outline of happiness index of residents with respect to Education Level and Age Category , box plot will serve the purpose.
After performing necessary modifications, the final code and design are as follows:
p4 <- ggplot(data=data,
aes(y=joviality,x=ageGroup))+
geom_boxplot()+
labs(y= 'Happiness \nIndex',title="Residents of which age & education qualification are more jovial ?", x='Age Category', fill="Education Level") +
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),panel.background = element_blank(), axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5), axis.title.y.left = element_text(vjust = 0.5), text = element_text(size=15,face="bold"), axis.text.x = element_text(angle = 45, vjust = 1, hjust=1) )+
facet_wrap(~ factor(data$educationLevel, level = c('Low', 'HighSchoolOrCollege','Bachelors','Graduate')))
p4
The original design is shown below
Title of main chart is missing. Title is an important component of any graph.
Annotation can be added to explain the significant finding of the chart
An enlightening data visualisation will be incomplete without a well labeled data points or data values. Count or Percentage values on the bars would help the viewers to better understand the graph and to draw insights. Numbers grab the attention swiftly and its easy for comparison rather looking for the corresponding values at y-axis.
New axes labels can be provided for graphs rather than using same as its column names because labels such as No. of Residents conveys better meaning than count.
Tick marks can be superfluous on categorical scale. Here interest groups are categorical and so tick marks on the x-axis are not required.
Vertical gridlines are not necessary in this chart as it doesn’t convey any additional information. Horizontal gridlines are sufficient to view the height of bars.
Rough sketch of proposed design is shown below
After performing necessary modifications, the final code and design is as follows:
p5 <- ggplot(data,aes(x=interestGroup))+
geom_bar(fill= 'light blue')+
geom_text(stat='count', aes(label=paste0(stat(count))),vjust=-0.5)+
labs(y= 'No. of\n Residents',title="Which Interest Group is predominant?", x='Interest Group') +
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),panel.background = element_blank(), axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5), axis.title.y.left = element_text(vjust = 0.5), text = element_text(size=12,face="bold") )
p5
The original design is shown below
The size of individual charts can be increased to give a clear picture
Annotation can be added to explain the significant finding of the chart
Charts with same y values can be plotted side by side with common axes to use the space wisely.
It would be easier for users to read an interpret the graph if its axes labels are written horizontally instead of vertical direction. Here age groups categories can be tilted slightly.
Three individual charts are combined using
patchwork()[https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html]
Also, Main title for the chart is created using plot_annotation
along with orientation adjustment using theme()
The below code chunk accomplishes all the above mentioned formatting
(patch1+patch3)/patch4+
plot_annotation(title = "Overview of Demographics of Ohio Residents'") &
theme(plot.title = element_text(hjust = 0.5),
axis.text = element_text(size=12,face="bold"))
Take Home Exercise 2 helped me to understand the significance of using right visualizations and approaching the charts from the audience perspective. I get to learn more from my peers’ work in terms of efficient coding. This exercise also helped me to spend more time in understanding the ggplot (grammar of graphics) much better and adding layers to it for building an insightful chart.